-
-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transform FERC-714 load forecast table #3670
Transform FERC-714 load forecast table #3670
Conversation
Results from
|
@zaneselvans Been looking around at the tests in the repo. Correct me if I'm wrong but seems like no new unit tests needed here. My changes just rely on schema enforcement, which seems to already be tested in But I'm wondering about adding a validation script for FERC 714 since we don't have one yet in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for putting this together! I made some minor naming & Dagster infrastructure comments.
Have you done any plotting or other investigation that might highlight data quality issues? For a visual sanity check on the data, would you be up for using it to recreate this plot from RMI out of this article?
src/pudl/transform/ferc714.py
Outdated
io_manager_key="parquet_io_manager", | ||
op_tags={"memory-use": "high"}, # Should this be high? | ||
compute_kind="pandas", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I believe this asset should use the
pudl_io_manager
-- in general we send all output to both SQLite and Parquet. Only the hourly tables (which can run into the hundreds of millions of rows) are written to Parquet alone. - I don't think this asset will be high memory -- with just annual data and 10 years of forecasts for each planning area it'll be a pretty small table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making these updates now.
In terms of when to use pudl_io_manager
vs parquet_io_manager
, is it basically a judgment call/depends on how the dev work is going? As in, if the transformation we're trying to run is going to output data of at least a certain size, then we go with parquet? Similar question for tags like memory-use
- it's sort of on a case-by-case basis?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now we use pudl_io_manager
for everything with the exception of the hourly time series, which can be quite long. For hourly tables we use parquet_io_manager
Not all of the hourly tables really need to only be in Parquet, but enough of them do that it seemed simpler to make this clear distinction.
"High" memory use is any transform that has a peak memory usage that's higher than one CPU's fair share of the memory on the nightly build machine (16 CPUs and 64GB of memory, so anything larger than 4GB). Identifying which assets fall into this category is haphazard right now -- I basically ran all of the ones I thought might be high memory and watched resource consumption on my laptop. I'm sure we could do some real profiling and get a more comprehensive answer.
I think this is correct. We rely pretty heavily on running a "fast" ETL for integration testing, which I think should exercise your new code, since the FERC-714 module is already being scanned for assets by Dagster.
We are just starting the process of converting the validation tests into Dagster asset checks, so that they run during the ETL instead of infrequently and after the fact. @jdangerx has prototyped this kind of check in |
Also yes let me run some sanity checks and viz. And I’ll take a look at the Dagster asset checks you mentioned. |
…eess1/pudl into issue-3519-transform-forecast
For more information, see https://pre-commit.ci
Added some row count checks in my latest commit. |
Just caught an issue that I'm handling now |
…eess1/pudl into issue-3519-transform-forecast
…eess1/pudl into issue-3519-transform-forecast
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@zaneselvans I've added some initial analysis in a notebook here. Seeing a couple interesting things so far. Let me know what you think. One thing that's a bit concerning is that not all the respondents provided 10 years' worth of predictions in every report year. There were only a few instances where this is the case: Not sure if we want to do something specific about this but wanted to flag it with you in case. |
Huh, that's weird that they don't always report 10 years of data, but pretty rare and also not crazy in the context of how bad a lot of the FERC data is. I wouldn't worry about it for the moment. It looks like there's an alembic migration error now, probably due to the newly merged in changes from I'm curious if you tried either of the asset loading methods I had suggested, and if they worked or didn't work. from dagster import AssetKey
from pudl.etl import defs
demand_forecasts = defs.load_asset_value(AssetKey("core_ferc714__yearly_planning_area_demand_forecast")) import pandas as pd
from pudl.workspace.setup import PudlPaths
pudl_paths = PudlPaths()
demand_forecasts = pd.read_parquet(
pudl_paths.parquet_path("core_ferc714__yearly_planning_area_demand_forecast"),
dtype_backend="pyarrow",
) |
@seeess1 and @zaneselvans I took a look at the alembic issue and I think it should be a pretty easy fix. You should be able to use the following commands to satisfy alembic:
Then, you can |
…eess1/pudl into issue-3519-transform-forecast
Thanks a ton @zschira! Just ran the commands you shared and pushed up the changes. Will keep an eye on the latest build to see if the errors go away. @zaneselvans I tried loading the new asset other ways, including the ones you posted above, but none are working for me :( . Here's what I get when I try the first one:
And here's what I get when I try the second one:
Maybe there's something very simple that I need to do but not sure. Re some respondents not reporting all 10 years of data: sounds good. I'll keep the data as-is. In the meantime, I'll play around a little more with the data we have to try and compare it with the actual usage for these respondents. |
Hmm, that seems odd. I'd really like to figure out what's up so we can help other folks avoid it. A few questions:
mamba list | grep catalystcoop
echo $DAGSTER_HOME
echo $PUDL_OUTPUT If you run python at the command line with the |
Ah, I just tried running the integration tests locally and they failed because the assertion about duplicate rows is only true when you process all years of data. Probably want to set an upper bound on dupes rather than an exact number.
|
Good catch. Updated here: |
…dl into issue-3519-transform-forecast
Okay that looks about like what I would expect to see in the VS Code. If it's running on Python 3.12.3 instead of 3.12.4 it's probably quite stale though. We update the dependency lockfile every Monday, so the first thing I would try is rebuilding the environment and restarting VS Code. |
…dl into issue-3519-transform-forecast
ee975b2
I disabled the |
Overview
NOTE: This is still a WIP. Just putting up a PR for now to make sure I'm on the right track and to get input.
Closes #3519.
What problem does this address?
This creates a core asset out of the
raw_ferc714__yearly_planning_area_forecast_demand
data.What did you change?
(Still in progress)
Testing
(Still in progress)
How did you make sure this worked? How can a reviewer verify this?
To-do list